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Abstract 

Inferring the causal structure that links n observables is usually based upon detecting statistical 
dependences and choosing simple graphs that make the joint measure Markovian. Here we argue 
why causal inference is also possible when only single observations are present. 

We develop a theory how to generate causal graphs explaining similarities between single ob- 
jects. To this end, we replace the notion of conditional stochastic independence in the causal 
Markov condition with the vanishing of conditional algorithmic mutual information and describe 
the corresponding causal inference rules. 

We explain why a consistent reformulation of causal inference in terms of algorithmic complex- 
ity implies a new inference principle that takes into account also the complexity of conditional 
probability densities, making it possible to select among Markov equivalent causal graphs. This 
insight provides a theoretical foundation of a heuristic principle proposed in earlier work. 

We also discuss how to replace Kolmogorov complexity with decidable complexity criteria. This 
can be seen as an algorithmic analog of replacing the empirically undecidable question of statistical 
independence with practical independence tests that are based on implicit or explicit assumptions 
on the underlying distribution. 
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1 Introduction to causal inference from statistical data 

Causal inference from statistical data has attracted increasing interest in the past decade. In 
contrast to traditional statistics where statistical dependences are only taken to prove that some 
kind of relation between random variables exists, causal inference methods in machine learning are 
explicitly designed to generate hypotheses on causal directions automatically based upon statistical 
independence tests [IJ [2] . The crucial assumption connecting statistics with causality is the causal 
Markov condition explained below after we have introduced some notations and terminology. 

We denote random variables by capitals and their values by the corresponding lowercase letters. 
Let Xi , . . . , X n be random variables and G be a directed acyclic graph (DAG) representing the 
causal structure where an arrow from node X.- L to node Xj indicates a direct causal effect. Here 
the term direct is understood with respect to the chosen set of variables in the sense that the 
information flow between the two variables considered is not performed via using one or more of 
the other variables as intermediate nodes. We will next briefly rephrase the postulates that are 
required in the statistical theory of inferred causation [2j [1] . 

1.1 Causal Markov condition 

When we consider the causal structure that links n random variables V := {X\, . . . , X n } we will 
implicitly assume that V is causally sufficient in the sense that all common causes of two variables 
in V are also in V. Then a causal hypothesis G is only acceptable as potential causal structure 
if the joint distribution P of Xi,... ,X n satisfies the Markov condition with respect to G. There 
are several formulations of the Markov condition that are known to coincide under some technical 
condition (see Lemma Q}. We will first introduce the following version which is sometimes referred 
to as the parental or the local Markov condition [5] . 
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To this end, we introduce the following notations. PAj is the set of parents of Xj and NDj the 
set of non-descendants of Xj except itself. If S, T, R are sets of random variables, S JLT\R means 
S is statistically independent of T, given R. 

Postulate 1 (statistical causal Markov condition, local) 

If a directed acyclic graph G formalizes the causal structure among the random variables X\ , . . . , X n . 
Then 

Xj ± NDj \PAj , 

for all j — 1, ... , n. 

We call this postulate the statistical causal Markov condition because we will later introduce an 
algorithmic version. The fact that conditional irrelevance not only occurs in the context of statistical 
dependences has been emphasized in the literature (e.g. [HQ]) m the context of describing abstract 
properties (like semi-graphoid axioms) of the relation • i • |-. We will therefore state the causal 
Markov condition also in an abstract form that does not refer to any specific notion of conditional 
informational irrelevance: 

Postulate 2 (abstract causal Markov condition, local) 

Given all the direct causes of an observable O, its non-effects provide no additional information 
on O. 

Here, observables denote something in the real world that can be observed and the observation 
of which can be formalized in terms of a mathematical language. In this paper, observables will 
either be random variables (formalizing statistical quantities) or they will be strings (formaliz- 
ing the description of objects). Accordingly, information will be statistical or algorithmic mutual 
information, respectively. 

The importance of the causal Markov condition lies in the fact that it links causal terms like 
"direct causes" and "non-effects" to informational relevance of observables. The local Markov 
condition is rather intuitive because it echoes the fact that the information flows from direct causes 
to their effect and every dependence between a node and its non-descendants involves the direct 
causes. However, the independences postulated by the local Markov condition imply additional 
independences. It is therefore hard to decide whether an independence must hold for a Markovian 
distribution or not, solely on the basis of the local formulation. In contrast, the global Markov 
condition makes the complete set of independences obvious. To state it we first have to introduce 
the following graph-theoretical concept. 

Definition 1 (d-separation) 

A path p in a DAG is said to be d-separated (or blocked) by a set of nodes Z if and only if 

1. p contains a chain i — > m — ► j or fork i «— m — > j such that the middle node m is in Z , or 

2. p contains an inverted fork (or collider) i — > m <— j such that the middle node m is not in Z 
and such that no descendant of m is in Z. 

A set Z is said to d-separate X from Y if and only if Z blocks every (possibly undirected) path from 
a node in X to a node in Y . 

The following Lemma shows that d-separation is the correct condition for deciding whether an 
independence is implied by the local Markov condition 4J, Theorem 3.27. 

Lemma 1 (equivalent Markov conditions) 

Let P(X\, . . . , X n ) have a density P(x\, . . . ,x n ) with respect to a product measure. Then the 
following three statements are equivalent: 
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I. Recursive form: P admits the factorization 

n 

P(xi,...,x n ) = Y[P(x j \pa j ) ) (1) 

3=1 

where P(.\pcij) is shorthand for the conditional probability density, given the values of all 
parents ofXj. 

II. Local (or parental) Markov condition: for every node Xj we have 

Xj ± NDj \PAj , 

i.e., it is conditionally independent of its non- descendants (except itself), given its parents. 
III. Global Markov condition: 

S ±T\R 

for all three sets S, T, R of nodes for which S and T are d-separated by R. 

Moreover, the local and the global Markov condition are equivalent even if P does not have a 
density with respect to a product measure. 

The conditional densities P(xj\paj) are also called the Markov kernels relative to the hypothet- 
ical causal graph G. It is important to note that every choice of Markov kernels define a Markovian 
density P, i.e., the Markov kernels define exactly the set of free parameters remaining after the 
causal structure has been specified. 

To select graphs among all those that render P Markovian, we also need an additional postulate: 

Postulate 3 (causal faithfulness) 

Among all graphs G for which P is Markovian, prefer the ones for which all the observed conditional 
independences in the joint measure P(X\, . . . , X n ) are imposed by the Markov condition. 

The idea is that the set of observed independences is typical for the causal structure under 
consideration rather than being the result of specific choices of the Markov kernels. This becomes 
even more intuitive when we restrict our attention to random variables with finite value set and 
observe that the values P(xj\pctj) then define a natural parameterization of the set of Markovian 
distributions in a finite dimensional space. The non-faithful distributions form a submanifold of 
lower dimension, i.e., a set of Lebesgue measure zero [5]. They therefore almost surely don't 
occur if we assume that "nature chooses" the Markov kernels for the different nodes independently 
according to some density on the parameter space. 

The above "zero Lebesgue measure argument" is close to the spirit of Bayesian approaches [5] , 
where priors on the set of Markov kernels are specified for every possible hypothetical causal DAG 
and causal inference is performed by maximizing posterior probabilities for hypothetical DAGs, 
given the observed data. This procedure leads to an implicit preference of faithful structures in the 
infinite sampling limit given some natural conditions for the priors on the parameter space. The 
assumption that "nature chooses Markov kernels independently" , which is also part of the Bayesian 
approach, will turn out to be closely related to the algorithmic Markov condition postulated in this 
paper. 

We now discuss the justification of the statistical causal Markov condition because we will later 
justify the algorithmic Markov condition in a similar way. To this end, we introduce functional 
models pQ: 
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Postulate 4 (functional model of causality) 

If a directed acyclic graph G formalizes the causal relation between the random variables X\ , . . . , Xn 
then every Xj can be written as a deterministic function of PAj and a noise variable Nj , 

X j = f j (PA j ,N j ), 

where all Nj are jointly independent. 
Then we have [I], Theorem 1.4.1: 

Lemma 2 (Markov condition in functional models) 

Every joint distribution P(X±, . . . , X n ) generated according to the functional model in Postulate^ 
satisfies the local and the global Markov condition relative to G. 

We rephrase the proof in [T] because our proof for the algorithmic version will rely on the same 
idea. 

Proof of Lemma [5] extend G to a graph G with nodes X\, . . . , X n) AT lf . . . , N n that additionally 
contains an arrow from each Nj to Xj. The given joint distribution of noise variables induces a 
joint distribution 

P(X 1 ,...,X n ,N 1 ,...,N n ), 

that satisfies the local Markov condition with respect to G: first, every Xj is completely determined 
by its parents making the condition trivial. Second, every Nj is parentless and thus we have to 
check that it is (unconditionally) independent of its non-descendants. The latter are deterministic 
functions of {iVi, . . . , N n } \ {Nj}. Hence the independence follows from the joint independence of 
all Ni. 

By Lemma [TJ P is also globally Markovian w.r.t. G. Then we observe that NDj and Xj are 
d-separated in G (where the parents and non-descendants are defined with respect to G) . Hence P 
satisfies the local Markov condition w.r.t. G and hence also the global Markov condition. □ 

Functional models formalize the idea that the outcome of an experiment is completely deter- 
mined by the values of all relevant parameters where the only uncertainty stems from the fact that 
some of these parameters are hidden. Even though this kind of determinism is in contrast with the 
commonly accepted interpretation of quantum mechanics [7] , we still consider functional models as 
a helpful framework for discussing causality in real life since quantum mechanical laws refer mainly 
to phenomena in micro-physics. 

Causal inference using the Markov condition and the faithfulness assumption has been imple- 
mented as causal learning algorithms [2\. The following fundamental limitations of these methods 
deserve our further attention: 

1. Markov equivalence: There are only few cases where the inference rules provide unique causal 
graphs. Often one ends up with a large class of Markov equivalent graphs, i.e., graphs that 
entail the same set of independences. For this reason, additional inference rules are desirable. 

2. Dependence on i.i.d. sampling: the whole setting of causal inference relies on the ability to 
sample repeatedly and independently from the same joint distribution P(X\, . . . , X n ). As op- 
posed to this assumption, causal inference in real life also deals with probability distributions 
that change in time and often one infers causal relations among single observations without 
referring to statistics at all. 

The idea of this paper is to develop a theory of probability-free causal inference that helps to 
construct causal hypotheses based on similarities of single objects. Here, similarities will be defined 
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by comparing the length of the shortest description of single objects to the length of their shortest 
joint description. Despite the analogy to causal inference from statistical data (which is due to 
known analogies between statistical and algorithmic information theory) our theory also implies 
new statistical inference rules. In other words, our approach to address weakness 2 also yields new 
methods to address 1. 

The paper is structured as follows. In the remaining part of this Section, i.e., Subsection 1 1.2i we 
describe recent approaches from the literature to causal inference from statistical data that address 
problem 1 above. In Section [5] we develop the general theory on inferring causal relations among 
individual objects based on algorithmic information. This framework appears, at first sight, as a 
straightforward adaption of the statistical framework (using well-known correspondences between 
statistical and algorithmic information theory). However, Section[3]describes that this implies novel 
causal inference rules for statistical inference because non- statistical algorithmic dependences can 
even occur in data that were obtained from statistical sampling. In Section [4] we describe how to 
replace causal inference rules based on the uncomputable algorithmic information with decidable 
criteria that are still motivated by the uncomputable idealization. 

The table in fig. [T] summarizes the analogies between the theory of statistical and the theory of 
algorithmic causal inference described in this paper. The differences, however, which are the main 
subject of Sections [3] to [U can hardly be represented in the table. 



1.2 Seeking for new statistical inference rules 

In [5] and [9] we have proposed causal inference rules that are based on the idea that the factorization 
of P(cause, effect) into P(effect|cause) and P(cause) typically leads to simpler terms than the 
"artificial" factorization into P(effect)P(cause|effect). The generalization of this principle reads: 
Among all graphs G that render P Markovian prefer the one for which the decomposition in eq. fl} 
yields the simplest Markov kernels. We have called this vague idea the "principle of plausible 
Markov kernels" . 

Before we describe several options to define simplicity we describe a simple example to illustrate 
the idea. Assume we have observed that a binary variable X (with values x = —1,1) and a 
continuous variable Y (with values in M) are distributed according to a mixture of two Gaussians 
(see fig. [2]). Since this will simplify the further discussion let us assume that the two components 
are equally weighted, i.e., 

1 1 (y-y-xX) 2 

P(x,y) = -- 



where A determines the shift of the mean caused by switching between x = 1 and x = — 1. 
The marginal P(Y) is given by 

11/ (K-M + A) 2 (h-h-A) 2 \ 

P[y) = -—(e-^ +e -^y (2 ) 

One will prefer the causal structure X — > Y compared to Y — > X because the former explains in a 
natural way why P(Y) is bimodal: the effect of X on Y is simply to shift the Gaussian distribution 
by 2A. In the latter model the bimodality of P(Y) remains unexplained. To prefer one causal model 
to another one because the corresponding conditionals are simpler seems to be a natural application 
of Occam's Razor. However, Section [3] will show that such an inference rule also follows from the 
theory developed in the present paper when simplicity is meant in the sense of low Kolmogorov 
complexity. In the remaining part of this section we will sketch some approaches to implement the 
"principle of plausible Markov kernels" in practical applications. 

In [8] we have defined a family of "plausible Markov kernels" by conditionals P(Xj\PAj) that 
are second order exponential models, i.e., log P(xj\pa,j) is a polynomial of order two in the variables 
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Figure 1: Analogies between statistical and algorithmic causal inference 
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Figure 2: Observed joint distribution of X and Y consisting of two Gaussians of equal width shifted against 
each other. 

{Xj} U {PAj} up to some additive partition function (for normalization) that depends only on the 
variables PAj. For every hypothetical causal graph, one thus obtains a family of "plausible joint 
distributions P(X\, . . . ,X n )" that are products of the plausible Markov kernels. Then we prefer 
the causal direction for which the plausible joint distributions provide the best fit for the given 
observations. 

In [9] we have proposed the following principle for causal inference: Given a joint distribution 
of the random variables Xi, . . . , X n , prefer a causal structure for which 



is minimal, where C is some complexity measure on conditional probability densities. 

There is also another recent proposal for new inference rules that refers to a related simplicity 
assumption, though formally quite different from the ones above. The authors of [lOj observe that 
there are joint distributions of X%, . . . ,X n that can be explained by a linear model with additive 
non-Gaussian noise for one causal direction but require non-linear causal influence for the other 
causal directions. For real data they prefer the causal graph for which the observations are closer 
to the linear model. 

To justify the belief that conditionals that correspond to the true causal direction tend to be 
simpler than non-causal conditionals (which is common to all the approaches above) is one of the 
main goals of this paper. 

2 Inferring causal relations among individual objects 

It has been emphasized 1 that the application of causal inference principles often benefits from the 
non-determinism of causal relations between the observed random variables. In contrast, human 
learning in real- life often is about quite deterministic relations. Apart from that, the most important 
difference between human causal learning and the inference rules in [5J [T] is that the former is also 
about causal relations among single objects and does not necessarily require sampling. Assume, 
for instance, that the comparison of two texts show similarities (see e.g. [11]) such that the author 
of the text that appeared later is blamed to have copied it from the other one or both are blamed 
to have copied from a third one. The statement that the texts are similar could be based on a 
statistical analysis of the occurrences of certain words or letter sequences. However, such kind of 
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(3) 
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simple statistical tests can fail in both directions: In Subsection 12.21 (before Theorem [3]) we will 
discuss an example showing that they can erroneously infer causal relations even though they do 
not exist. This is because parts that are common two both objects, e.g., the two texts, are only 
suitable to prove a causal link if they arc not "too straightforward" to come up with. 

On the other hand, causal relations can generate similarities between texts for which every 
efficient statistical analysis is believed to fail. We will describe an idea from cryptography to 
show this. A cryptosystem is called ROR-CCA-secure (Real or Random under Chosen Ciphertext 
Attacks) if there is no efficient method to decide whether a text is random or the encrypted version 
of some known text without knowing the key [TJ]. Given that there are ROR-CCA-secure schemes 
(which is unknown but believed by cryptographers) we have a causal relation leading to similarities 
that are not detected by any kind of simple counting statistics. However, once an attacker has found 
the key (maybe by exhaustive search) , he recognizes similarities between the encrypted text and the 
plain text and infers a causal relation. This already suggests two things: (1) detecting similarities 
involves searching over potential rules how properties of one object can be algorithmically derived 
from the structure of the other. (2) It is likely that inferring causal relations therefore relies on 
computationally infeasible decisions (if computable at all) on whether two objects have information 
in common or not. 

2.1 Algorithmic mutual information 

We will now describe how the information one object provides about the other can be measured 
in terms of Kolmogorov complexity. We start with some notation and terminology. Below, strings 
will always be binary strings since every description given in terms of a different alphabet can 
be converted into a binary word. The set of binary strings of arbitrary length will be denoted 
by {0, 1}*. Recall that the Kolmogorov complexity K (s) of a string s £ {0, 1}* is defined as the 
length of the shortest program that generates s using a previously defined universal Turing machine 
[13 El US HS1 H3HH1 US- The conditional Kolmogorov complexity K(t\s) [18] of a string t given 
another string s is the length of the shortest program that can generate t from s. In order to keep 
our notation simple we use K (x, y) to refer to the complexity of the concatenation of x, y. 

We will mostly have equations that are valid only up to additive constant terms in the sense 
that the difference between both sides does not depend on the strings involved in the equation (but 
it may depend on the Turing machines they refer to). To indicate such constants we denote the 
corresponding equality by = and likewise for inequalities. In this context it is important to note 
that the number n of nodes of the causal graph is considered to be a constant. Moreover, for every 
string s we define s* as its shortest description. If the latter is not unique, we consider the first 
one in an lexicographic order. It is necessary to distinguish between K(-\s) and K(-\s*). This is 
because there is a trivial algorithmic method to generate s from s* (just apply the Turing machine 
to s*), but there is no algorithm of length 0(1) that computes the shortest description s* from a 
general input s. One can show [19j that s* = (s,K(s)). Here, the equivalence symbol = means 
that both sides can be obtained from each other by O(l) programs. The following equation for the 
joint algorithmic information of two strings x,y will be useful [20] : 

K(x, y) = K{x) + K{y\x*) = K{x) + K(y\x, K{x)) . (4) 

The conditional version reads [20] : 

K(x, y\z) = K(x\z) + K(y\x, K(x\z), z) (5) 

The most important notion in this paper will be the algorithmic mutual information measuring the 
amount of algorithmic information that two objects have in common. Following |21] we define: 
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Definition 2 (algorithmic mutual information) 

Let x,y be two strings. Then the algorithmic mutual information of X,y is 

I(x:y) :=K(y)-K(y\x*). 

The mutual information is the number of bits that can be saved in the description of y when 
the shortest description of x is already known. The fact that one uses x* instead of x ensures that 
it coincides with the symmetric expression |21j : 

Lemma 3 (symmetric version of algorithmic mutual information) 

For two strings x,y we have 

I(x : y) ± K{x) + K{y) - K(x, y) . 

In the following sections, non-vanishing mutual information will be taken as an indicator for 
causal relations, but more detailed information on the causal structure will be inferred from con- 
ditional mutual information. This is in contrast to approaches from the literature to measure 
similarity versus differences of single objects that we briefly review now. To measure differences 
between single objects, e.g. pictures |22[ 123]. one defines the information distance E(x,y) between 
the two corresponding strings as the length of the shortest program that computes x from y and y 
from x. It can be shown [22] that 

E(x,y) ^max{K(x\y),K(y\x)}, 

where == means equality up to a logarithmic term. However, whether E(x, y) is small or large is 
not an appropriate condition for the existence and the strength of a causal link. Complex objects 
can have much information in common even though their distance is large. In order to obtain a 
measure that relates the amount of information that is disjoint for the two strings to the amount 
they share, Li et al. [23] and Bennett et al. [11] use the "normalized distance measure" 

d s (x,y):= KW \- K ^±l 



or 

d(x,y) 



K(x,y) K(x,y) 1 

ma,x{K(x\y),K(y\x)} 



max{ir(a;), K (y)} 

The intuitive meaning of d s (x,y) is obvious from its direct relation to mutual information, and 
1 — d(x 7 y) measures the fraction of the information of the more complex string that is shared with 
the other one. Bennett et al. [11] propose to construct evolutionary histories of chain letters using 
such kinds of information distance measures. However, like in statistical causal inference, inferring 
adjacencies on the basis of strongest dependences is only possible for simple causal structures like 
trees. In the general case, non-adjacent nodes can share more information than adjacent ones when 
information is propagated via more than one path. Instead of constructing causal neighborhood 
relations by comparing information distances we will therefore use conditional mutual information. 

In order to define its algorithmic version, we first observe that Definition [2] can be rewritten 
into the less concise form 

I(x:y)±K(y)-K(y\x,K(x)). 
This formula generalizes more naturally to the conditional analog [20j : 

Definition 3 (conditional algorithmic mutual information information) 

Letx,y,z be three strings. Then the conditional mutual algorithmic information ofx,y, given z is 

I(x : y\z) := K(y\z) - K(y\x, K (x\z), z) . 
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As shown in [20] (Remark II. 3), the conditional mutual information also is symmetric up to a 
constant term: 

Lemma 4 (symmetric algorithmic conditional mutual information) 

For three strings x,y, z one has: 

I(x : y\z) ± K(x\z) + K(y\z) - K(x, y\z) . 

Definition 4 (algorithmic conditional independence) 

Given three strings x, y, z, we call x conditionally independent of y, given z ( denoted by x i y \z) 
if 

I{x:y\z)^{). 

In words: Given z, the additional knowledge of y does not allow us a stronger compression of x. 
This remains true if we are given the Kolmogorov complexity of y, given z. 

The theory developed below will describe laws where symbols like x,y,z represent arbitrary 
strings. Then one can always think of sequences of strings of increasing complexity and statements 
like "the equation holds up to constant terms" are well-defined. We will then understand conditional 
independence in the sense of I(x : y\z) = 0. However, if we are talking about three fixed strings 
that represent objects in real-life, this does not make sense and the threshold for considering two 
strings dependent will heavily depend on the context. For this reason, we will not specify the 
symbol ~ any further. This is the same arbitrariness as the cutoff rate for statistical dependence 
tests. 

The definitions and lemmas presented so far were strongly motivated by the statistical analog. 
Now we want to focus on a theorem in [5T] that provides a mathematical relationship between 
algorithmic and statistical mutual information. First we rephrase the following theorem Theorem 
7.3.1 of [18] , showing that the Kolmogorov complexity of a random string is approximatively given 
by the entropy of the underlying probability distribution: 

Theorem 1 (entropy and Kolmogorov complexity) 

Let x = x%, X2, ■ ■ ■ , x n be a string whose symbols xj G A are drawn i.i.d. from a probability distri- 
bution P{X) over the finite alphabet A. Slightly overloading notation, set -P(x) := P{x{) ■ ■ ■ P(x n ). 
Let H{.) denote the Shannon entropy of a probability distribution. Then there is a constant c such 
that 

H(P(X)) < -E(K{x\n)) < H(P(Xj) + + 1 V n , 

n n n 

where E{.) is short hand for the expected value with respect to -P(x). Hence 

lim -E(K(X)) = H(P(X)) . 

n — >oo fi 

However, for our purpose, we need to see the relation between algorithmic and statistical mutual 
information. If x = x\, X2, ■ ■ ■ , x n and y = y\, yi, • ■ ■ , y n such that each pair (xj,yj) is drawn i.i.d. 
from a joint distribution P(X, Y), the theorem already shows that 

lim -E(I(ic:y))=I(X;Y). 

n — >oo Ji 

This can be seen by writing statistical mutual information as 

H(P(X)) + H(P(Y)) - H{P(X, Y)) . 
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The above translations between entropy and algorithmic information refer to a particular setting 
and to special limits. The focus of this paper is mainly the situation where the above limits are not 
justified. Before we rephrase Theorem 5.3 in |21j which provides insights into the general case, we 
recall that a function / is called recursive if there is a program on a Turing machine that computes 
f(x) from the input x, and halts on all possible inputs. 

Theorem 2 (statistical and algorithmic mutual information) Given string-valued random 
variables X,Y with a recursive probability mass function P{x,y) over pairs (x,y) of strings. We 
then have 

I(X; Y) - K{P) < E(I(x : y)) < I(X; Y) + 2K(P) , 
where K(P) is the length of the shortest prefix-free program that computes P(x, y) from (x, y). 

We want to provide an intuition about various aspects of this theorem. 

(1) If I(X; Y) is large compared to K(P) the expected algorithmic mutual information is dominated 
by the statistical mutual information. 

(2) If K(P) is no longer assumed to be small, statistical dependences do not necessarily ensure that 
the knowledge of x allows us to compress y further than without knowing x. It could be that the 
description of the statistical dependences requires more memory space than its knowledge would 
save. 

(3) On the other hand, knowledge of x could allow us to compress y even in the case of a product 
measure on x and y. Consider, for instance, the case that we have the point mass distribution on 
the pair (x, y) with x — y. To describe a more sophisticated example generalizing this case we first 
have to introduce a family of product probability distributions on {0, 1}™ that we will need several 
times throughout the paper. 

Definition 5 (Defining product distributions by strings) 

Let Pq, Pi be two probability distributions on {0, 1} and c be a binary string of length n. Then 

P c := P C1 ® P C2 <£> • • • ® P Cn 

defines a distribution on {0, 1}™. We will later also need the following generalization: If Pqq, Pqi, Pio, Pii 
are four distributions on {0, 1}, then 

P c ,rf := P Cl ,di ® p c2,d2 ® • • • ® P Cn .d n 

defines also a family of product measures on {0, l} n that is labeled by two strings. 

Denote by Pf m the m-fold copy of P c from Definitional It describes a distribution on {0, l}" m 
assigning the probbaility Pf m (x) to x € {0, l} nm . If 

Q(x,y) :=Pf m (x)Pf m (y), 

knowledge of x in the typical case provides knowledge of c, provided m is large enough. Then we 
can compress y better than without knowing x because we do not have to describe c any more. 
Hence the algorithmic mutual information is large and the statistical mutual information is zero 
because Q is by construction a product distribution. In other words, algorithmic dependences in a 
setting with i.i.d sampling can arise from statistical dependences and from algorithmic dependences 
between probability distributions. 
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2.2 Markov condition for algorithmic dependences among individual ob- 
jects 

Now we state the causal Markov condition for individual objects as a postulate that links algorithmic 
mutual dependences with causal structure: 

Postulate 5 (algorithmic causal Markov condition) 

Let x\, . . . ,x n be n strings representing descriptions of observations whose causal connections are 
formalized by a directed acyclic graph G with x%, . . . , x n as nodes. Let paj be the concatenation of 
all parents of Xj and ndj the concatenation of all its non-descendants except Xj itself. Then 

Xj i ndj \pa* . 

As in Definition^ the appropriate cut-off rate for rejecting G when I(xj : ndj\pa*) > will not be 
specified here. 

This formulation is a natural interpretation of Postulate[2]in terms of algorithmic independences. 
The only point that remains to be justified is why we condition on pa* instead of paj, i.e., why we 
are given the optimal joint compression of the parent strings. The main reason is that this turns out 
to yield nice statements on the equivalence of different Markov conditions (in analogy to Lemma[I|. 
Since the differences between I(xj : ndj\paj) and I(xj : ndj\pa*) can only be logarithmic in the 
string length^ we will not focus on this issue any further. 

If we apply Postulate [5] to a trivial graph consisting of two unconnected nodes, we obtain the 
following statement. 

Lemma 5 (causal principle for algorithmic information) 

If the mutual information I(x : y) between two objects x, y is significantly greater than zero they 
have some kind of common past. 

Here, common past between two objects means that one has causally influenced the other or 
there is a third one influencing both. The statistical version of this principle is part of Reichenbach's 
principle of the common cause [21] stating that statistical dependences between random variables!! 
X and Y are always due at least one of the following three types of causal links: (1) X is a cause 
of Y or (2) vice versa or (3) there is a common cause Z. For objects, the term "common past" 
includes all three types of causal relations. For a text, for instance, it reads: similarities of two texts 
x, y indicate that one author has been influenced by the other or that both have been influenced 
by a third one. 

Before we construct a model of causality that makes it possible to prove the causal Markov 
condition we want to discuss some examples. If one discovers significant similarities in the genome 
of two sorts of animals one will try to explain the similarities by relatedness in the sense of evolution. 
Usually, one would, for instance, assume such a common history if one has identified long substrings 
that both animals have in common. However, the following scenario shows two observations that 
superficially look similar, but nevertheless we cannot infer a common past since their algorithmic 
complexity is low (implying that the algorithmic mutual information is low, too). 

Assume two persons are instructed to write down a binary string of length 1000 and both decide 

to write the same string x = 1100100100001111110 It seems straightforward to assume that the 

persons have communicated and agreed upon this choice. However, after observing that x is just 
the binary representation of tt, one can easily imagine that it was just a coincidence that both 
wrote the same sequence. In other words, the similarities are no longer significant after observing 

1 this is because K(x\y) — K(x\y*) = 0(log \y\), see [19] 

2 The original formulation considers actually dependences between events, i.e., binary variables. 
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that they stem from a simple rule. This shows that the length of the pattern that is common to 
both observations, is not a reasonable criterion on whether the similarities are significant. 

To understand the algorithmic causal Markov condition we will study its implications as well 
as its justification. In analogy to Lemma[T]we have 

Theorem 3 (equivalence of algorithmic Markov conditions) 

Given the strings x%,...,x n and a directed acyclic graph G. Then the following conditions are 
equivalent: 

I. Recursive form: the joint complexity is given by the sum of complexities of each node, given 
the optimal compression of its parents: 

n 

K{x 1 ,...,x n )=^K{x J \pa*) (6) 

II. Local Markov condition: Every node is independent of its non- descendants, given the 
optimal compression of its parents: 

I(xj : ndj\pa*) = . 

III. Global Markov condition: 

I(S : T\R*) = 

if R d-separates S and T. 

Below we will therefore no longer distinguish between the different versions and just refer to "the 
algorithmic Markov condition". The intuitive meaning of eq. © is that the shortest description 
of all strings is given by describing how to generate every string from its direct causes. A similar 
kind of "modularity" of descriptions will also occur later in a different context when we consider 
description complexity of joint probability distributions. 

For the proof of Theorem[3] we will need a Lemma that is an analogue of the observation that for 
any two random variables X,Y the statistical mutual information satisfies I(f(X);Y) < I(X;Y) 
for every measurable function /. The algorithmic analog is to consider two strings x,y and one 
string z that is derived from a;* by a simple rule. 

Lemma 6 (monotonicity of algorithmic information) 

Let x, y, z be three strings such that K(z\x*) = 0. Then 

l(z ■ y) < I(x ■■ y) ■ 

This lemma is a special case of Theorem II. 7 in [50] • We will also need the following result: 

Lemma 7 (monotonicity of conditional information) 

Let x, y, z be three strings. Then 

K{z\x*)>K{z\{x, V y). 

+ + 
Note that K(z\x*) > K(z\x*,y) and K(z\x*) > K(z\x*,y*) is obvious but Lemma [7] is non- 
trivial because the star operation is jointly applied to x and y. 
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Proof of Lemma [Jj Clearly the string x can be derived from x,y by a program of length 0(1). 
Lemma [6] therefore implies 

I(z : x) < I(z :x,y), 
where I(z : x, y) is shorthand for l(z : (x, y)). Hence 

K{z) - K{z\x*) = I(z : x) < I{z : x, y) 
± K(z)-K(z\(x,y)*). 

Then we obtain the statement by subtracting K (z) and inverting the sign. □ 

The following lemma will only be used in Subsection 13.31 We state it here because it is closely 
related to the ones above. 

Lemma 8 (generalized data processing inequality) 

For any three strings x, y, z, 

l(x:y\z*)±0 

implies 

I(x : y) < I(x : z) . 

The name "data processing inequality" is justified because the assumption x X y \z* may arise 
from the typical data processing scenario where y is obtained from x via z. 

Proof of Lemma [51 Using Lemma [7] we have 

K(x\y*) > K(x\{zy)*) (7) 
= K(x\z,y,K(yz)) 
± K(x\z,y,K(z)+K(y\z*)) 

> K(x\z,y,K(z),K(y\z*) 
± K(x\z*,y,K(y\z*)), 

where the second inequality holds because K (z) + K(y\z*) can obviously be computed from the 
pair (K(z), K(y\z*)) by an 0(1) program. The last equality uses, again, the equivalence of z* and 
(z 1 K(z)). Hence we obtain: 

I(x:y) = K(x)-K(x\y*)=K(x\z*) + I(x:z)-K(x\y*) 
< K(x\z*) + I(x : z) - K(x\y, K(y\z*), z*) 
= I(x : z) + I(x : y\z*) = I(x : z) . 

The first step is by Definition^ the second one uses Lemma [71 the third step is a direct application 
of ineq. ([Jj, the fourth one is due to Definition [3j and the last step is by assumption. □ 

Proof of Thcorcm[31 I => III: Define a probability mass function P on ({0, 1}*) X ™, i.e., the set of 
n-tuples of strings, as follows. Set 

P{xj\paj) := —2- K{ - x ^ a V , (8) 
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where Zj is a normalization factor. In this context, it is important that the symbol paj refers 
to conditioning on the A:-tuple of strings x% that are parents of Xj (in contrast to conditional 
complexities where we can interpret K(.\pa*) equally well as conditioning on one string given by 
the concatenation of all those Xi). Note that Kraft's inequality (see [19 , Example 3.3.1) implies 

^ 2 -jc(x|») < 1; 

X 

for every y entailing that the expression is indeed normalizable by Zj < 1. We have 

K( Xj \pa*) = - log 2 P(xj {paj) . 

Then we set 

n 

P(xi, . . . ,x n ) := Y[ p ( x j\P a j) i ( 9 ) 

i.e., P is by construction recursive with respect to G. It is easy to see that K(x±, . . . ,x n ) can be 
computed from P: 

n 

K(x u ...,x n ) ± J2 K &\P a V ( 10 ) 

3=1 

n 

= -^log 2 P(xj\paj) 

= -log 2 P(xi,...,x n ). 

Remarkably, we can also compute Kolmogorov complexities of subsets of {x\, . . . ,x n } from the 
corresponding marginal probabilities. We start by proving 

K(x u ± - log 2 £ 2- K ^"*J . (11) 

x n 

To this end, we observe 

^""^ 2-K(x 1 ,...,x n ) ^""^ 2^ K ( x 'i----^ x n-i-)- K ( x n\{xi,...,x n - 1 Y) (VI) 

<- 2~K(x 1 ,...,x n - 1 ) 

where = denotes equality up to a multiplicative constant. The equality follows from eq. ([4]) 
and the inequality is obtained by applying Kraft's inequality [T^ to the conditional complexity 
K{.\{x\, . . . , x n ~i)*). On the other hand we have 

K(xi, . . . ,x n -i) = K(xi, . . . ,x„_i,0) , 

since adding the 1-bit string x n = certainly can be performed by a program of length 0(1). Hence 
we have 

^ ^ 2~ K(x\ 1 ...,x n ) \, 2 _ K(xi,...,x n -i,0) 

X n 

21 2~ K l x i----> x ™-i) 
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Combining this with ineq. (| 1 2j> yields 



-k(xi,...,x„-i) x. 2~- g ( z i'— 



Using eq. (JTUJ) we obtain 



± -log 2 V2-^ 



2^ 

= -log 2 P(xi, . . . ,x„_i) , 

which proves equation (jlip . This implies 

iT(a:i, . . . ,x n -i) = -log 2 P(zi, . . . ,x„_i) . 
Since the same argument holds for marginalizing over any other variable Xj we conclude that 

K (vh > • • • ! x i J = - lo S2 ^<>ji . • • • . ^ J , (13) 

for every subset of strings of size k with k < n. This follows by induction over n — k. 

Now we can use the relation between marginal probabilities and Kolmogorov complexities to 
show that conditional complexities are also given by the corresponding conditional probabilities, 
i.e., for any two subsets S, T c {x%, . . . , x n } we have 

K(S\T*)±-log 2 P(S\T). 

Without loss of generality, set S := {x\, . . . ,Xj} and T := {xj+i, . . . ,Xk} for j < k < n. Using 
eqs. (01 and (fHij) we get 

if(xi, ... ,Xj\(xj+i, ... ,Xk)*) = K(xi,...,Xk)—K(xj + i,...,Xk) 

- -!og 2 fp(xi,...,a:fc)/P(xj + i ) ...,a;fcn 

= - log 2 P(xi,. ■ ■ , Xj\x j+ i,. ..,x k ). 

Let 5, T, R be three subsets of {x%, . . . , x n } such that R d-separates S and T. Then 5 X T \R with 
respect to P because P satisfies the recursion (|9|) (see Lemma [T]H. Hence 

K(S,T\R*) ± -log 2 P(S,T|P) 

± -logP(S|P)-log 2 P(T|P) 
± i<T(5|P*)+ir(5|P*). 

This proves algorithmic independence of 5* and T, given R* and thus I =>■ III. 

To show that III =>■ II it suffices to recall that ndj and Xj are d-separated by paj . Now we show 
II =>• I in strong analogy to the proof for the statistical version of this statement in [3]: Consider 
first a terminal node of G. Assume, without loss of generality, that it is x n . Hence all strings 



3 



Since P is, by construction, a discrete probability function, P the density with respect to a product measure is 
directly given by the probability function itself. 
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x\ 1 . . . , x n -\ are non-descendants of x n . We thus have (nd n ,pa n ) = (x\, . . . , x n -\) where = means 
that both strings coincide up to a permutation (on one side) and removing those strings that occur 
twice (on the other side). Due to eq. (j4|) we have 

K[xx, ...,x n ) = K(xi, ■ ■ .,x n -x) + K(x n \(nd n ,pa n )*) . (14) 
Using, again, the equivalence of w* = (w, K(w)) for any string w we have 

K(x n \(nd n ,pa n )*) = K(x n \nd n ,pa n , K(nd n ,pa n )) 

= K(x n \ nd n , pa n ,K(pa n ) + K(nd n \pa* n ) ) 
+ 

> K(x n \nd n ,pa* n ,K{nd n \pa* n )) 

= K(x n \pa* n ). (15) 

The second step follows from K(nd n ,pa n ) — K(pa n ) + K{nd n \pa^). The inequality holds because 
nd n ,pa n , K(pa n )+K(nd n \pa^ l ) can be computed from n<i n ,pa* , K(nd n \pa^) via a program of length 
0(1). The last step follows directly from the assumption x n i nd n |pa* . Combining ineq. (|15p 
with Lemma [7] yields 

K(x n \(nd n ,pa n )*) = K{x n \pa* n ) . (16) 
Combining eqs. (fT6|) and (fl4|) we obtain 

K(xi, ...,x n ) = K(xi, . . .,x n -i) + K{x n \pa* n ) . (17) 
Then statement I follows by induction over n. □ 

To show that the algorithmic Markov condition can be derived from an algorithmic version of the 
functional model in Postulate [4] we introduce the following model of causal mechanisms. 

Postulate 6 (algorithmic model of causality) 

Let G be a DAG formalizing the causal structure among the strings x\, . . . , x n . Then every Xj is 
computed by a program qj with length 0(1) from its parents paj and an additional input rij. We 
write formally 

Xj = , 

meaning that the Turing machine computes Xj from the input paj , nj using the additional program 
qj and halts. The inputs nj are jointly independent in the sense 

rij X m, . . ., nj-i,nj + i,n„ . 

By defining new programs that contain nj we can, equivalently, drop the assumption that the pro- 
grams qj are simple and assume that they are jointly independent instead. 

We could also have assumed that Xj is a function fj of all its parents, but our model is more 
general since the map defined by the input-output behavior of qj need not be a total function [T5] , 
i.e., the Turing machine simulating the process would not necessarily halt on all inputs paj, rij. 

The idea to represent causal mechanisms by programs written for some universal Turing machine 
is basically in the spirit of various interpretations of the Church- Turing thesis. One formulation, 
given by Deutsch [25] , states that every process taking place in the real world can be simulated by 
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a Turing machine. Here we assume that the way different systems influence each other byphysical 
signals can be simulated by computation processes that exchange messages of bit stringso 

Note that mathematics also allows us to construct strings that are linked to each other in an 
uncomputable way. For instance, let x be an arbitrary binary string and y be defined by y :— K(x). 
However, it is hard to believe that a real causal mechanism could create such kind of relations 
between objects given that one believes that real processes can always be simulated by algorithms. 
These remarks are intended to give sufficient motivation for our model. 

Postulate [HI implies the algorithmic causal Markov condition: 

Theorem 4 (algorithmic model implies Markov) 

Let Xi,. . . ,X n be generated by the model in Postulate^ Then they satisfy the algorithmic Markov 
condition with respect to G. 

Proof (straightforward adaption of the proof of Lemma [2]): Extend G to a causal structure G 
with nodes x\, . . . , x n , rt\, . . . , n n . To see that the extended set of nodes satisfy the local Markov 
condition w.r.t. G, observe first that every node Xj is given by its parents via an O(l) program. 
Second, every nj is parentless and (unconditionally) independent of all its non-descendants because 
they can be computed from {tlx, . . . .n n } \ {nj} via an O(l) program. 

By Theorem [3] the extended set of nodes is also globally Markovian w.r.t. G. The parents paj 
d-separate Xj and ndj in G (here the parents paj are still defined with respect to G). This implies 
the local Markov condition for G. □ 

It is trivial to construct examples where the causal Markov condition is violated if the programs qj 
are mutually dependent (for instance, the trivial graph with two nodes x\,X2 and no edge would 
satisfy I(x\ : X2) > if the programs qi, q2 computing x%, X2 from an empty input are dependent). 

The last sentence of Postulate [6] makes apparent that the mechanisms that generate causal 
relations are assumed to be independent. This is essential for the general philosophy of this paper. 
To see that such a mutual independence of mechanisms is a reasonable assumption we recall that 
the causal graph is meant to formalize all relevant causal links between the objects. If we observe, 
for instance, that two nodes are generated from their parents by the same complex rule wc postulate 
another causal link between the nodes that explains the similarity of mechanisms H 

2.3 Relative causality 

This subsection explains why it is sensible to define algorithmic dependence and the existence or 
non-existence of causal links relative to some background information. To this end, we consider 
genetic sequences Sx, S2 of two persons that are not relatives. We certainly find high similarity that 
leads to a significant violation of I(si : S2) — due to the fact that both genes are taken from 
humans. However, given the background information "si is a human genetic sequence", s\ can be 

4 Note, however, that sending quantum systems between the nodes could transmit a kind of information ("quantum 
information" [26]) that cannot be phrased in terms of bits. It is known that this enables completely new communication 
scenarios, e.g. quantum cryptography. The relevance of quantum information transfer for causal inference is not yet fully 
understood. It has, for instance, been shown that the violation of Bell's inequality in quantum theory is also relevant for 
causal inference [27] . This is because some causal inference rules between classical variables break down when the latent 
factors are represented by quantum states rather than being classical variables. 

5 One could argue that this would be just the causal principle implying that similarities of the "machines" generating 
Xj from paj has to be explained by a causal relation, i.e., a common past of the machines. However, in the context 
of this paper, such an argument would be circular. We have argued that the causal principle is a special case of the 
Markov condition and derived the latter from the algorithmic model above. We will therefore consider the independence 
of mechanisms as a first principle. 
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further compressed. The same applies to S2- Let h be a code that is particularly adapted to the 
human genome in the sense that the expected conditional Kolmogorov complexity, given h, of a 
randomly chosen human genome is minimal. Then it would make sense to consider J(si : S2\h) > 
as a hint for a relation that goes beyond the fact that both persons are human. In contrast, for 
the unconditional mutual information we expect I(s\ : 82) > K(h). We will therefore infer some 
causal relation (here: common ancestors in the evolution) using the causal principle in Lemma [5] 
(cf. [28]). 

The common properties between different and unrelated individuals of the same species can 
be screened off by providing the relevant background information. Given this causal background, 
we can detect further similarities in the genes by the conditional algorithmic mutual information 
and take them as an indicator for an additional causal relation that goes beyond the common 
evolutionary background. For this reason, every discussion on whether there exists a causal link 
between two objects (or individuals) requires a specification of the background information. In this 
sense, causality is a relative concept. 

One may ask whether such a relativity of causality is also true for the statistical version of the 
causality principle, i.e., Reichenbach's principle of the common cause. In the statistical version 
of the link between causality and dependence, the relevance of the background information is less 
obvious because it is evident that statistical methods are always applied to a given statistical 
ensemble. If we, for instance, ask whether there is a causal relation between the height and the 
income of a person without specifying whether we refer to people of a certain age, we observe the 
same relativity with respect to additionally specifying the "background information" , which is here 
given by referring to a specific ensemble. 

In the following sections we will assume that the relevant background information has been 
specified and it has been clarified how to translate the relevant aspects of a real object into a 
binary string such that we can identify every object with its binary description. 

3 Novel statistical inference rules from the algorithmic Markov 
condition 

3.1 Algorithmic independence of Markov kernels 

To describe the implications of the algorithmic Markov condition for statistical causal inference, 
we consider random variables X and Y where X causally influences Y . We can think of P(X) as 
describing a source S that generates x-values and sends them to a "machine" M that generates 
y-values according to P(Y\X). Assume we observe that 

I(P(X) : P(Y\X)) > 0. 

Then we conclude that there must be a causal link between S and M that goes beyond transferring 
x-values from S to M. This is because P(X) and P(Y\X) are inherent properties of S and M, 
respectively which do not depend on the current value of x that has been sent. Hence there must 
be a causal link that explains the similarities in the design of S and M. Here we have assumed 
that we know that X — ► Y is the correct causal structure on the statistical level. Then we have to 
accept that a causal link on the level of the machine design is present. 

If the causal structure on the statistical level is unknown, we would prefer causal hypotheses 
that explain the data without needing a causal connection on the higher level provided that they 
satisfy the statistical Markov condition. Given this principle, we thus will prefer causal graphs G 
for which the Markov kernels P(Xj\PAj) become algorithmically independent. This is equivalent 
to saying that the shortest description of P(X±, . . . , X n ) is given by concatenating the descriptions 
of the Markov kernels, a postulate that has already been formulated by Lemeire and Dirkx [29] : 
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Postulate 7 (algorithmic independence of statistical properties) 

A causal hypothesis G (i.e., a DAG) is only acceptable if the shortest description of the joint 
density P is given by a concatenation of the shortest description of the Markov kernels, i. e. 

K(P(X 1 , . . . , X n )) ± K ^ X i \ PA ^ ■ ( 18 ) 

3 

If no such causal graph exists, we reject every possible DAG and assume that there is a causal 
relation of a different type, e.g., a latent common cause, selection bias, or a cyclic causal structure. 

The sum on the right hand side of eq. (|18j) will be called the total complexity of the causal 
model G. Note that Postulate [Jj implies that we have to reject every causal hypothesis for which 
the total complexity is not minimal because a model with shorter total complexity already pro- 
vides a shorter description of the joint distribution. Inferring causal directions by minimizing this 
expression (or actually a computable modification) could also be interpreted in a Bayesian way if 
we consider K(P(Xj\PAj)) as the negative log likelihood for the prior probability for having the 
conditional P(Xj\PAj) (after appropriate normalization). However, Postulate [7J contains an idea 
that goes beyond known Bayesian approaches to causal discovery because it provides hints on the 
incompleteness of the class of models under consideration (in addition to providing rules for giving 
preference within the class). 

Lemeire and Dirkx [29] already show that the causal faithfulness principle (Postulate [3]) follows 
from Postulate [JJ Now we want to show that it also implies causal inference rules that go beyond 
the known ones. 

To this end, we focus again on the example in Subsection 11.21 with a binary variable X and a 
continuous variable Y. The hypothesis X — > Y is not rejected on the basis of Postulate [7] because 
I(P(X) : P(Y\X)) = 0. For the equally weighted mixture of two Gaussians this already follow^ 
from K(P(X)) = 0. On the other hand, Y — > X violates Postulate [7J Elementary calculations 
show that the conditional P(X\Y) is given by the sigmoid function 

P(X = %) = i(l + tanhfc^). (19) 

We observe that the same parameters a,X,fi that occur in P(Y), also occur in P(X\Y). This 
already shows that the two Markov kernels are algorithmically dependent. To be more explicit, we 
observe that fi, A, and a are required to specify P(Y). To describe P(X\Y), we need A/tr 2 and /i. 
Hence we have 

K(P(Y))±K(v,\,a) ± K(fj,)+K(X)+K(a) 
K{P{X\Y))±KUi,\/o 2 ) ± K(p) + K(X/a 2 ) 
K(P(X,Y))±K(P(Y),P(X\Y)) ± K(p,X,a)^K^) + K(X)+K(a), 

where we have assumed that the strings fi, A, a are jointly independent. Note that the information 
that P(Y) is a mixture of two Gaussians and that P(X\Y) is a sigmoid counts as a constant because 
its description complexity does not depend on the parameters. 
We thus get 

I(P(Y):P(X\Y))±K(n) + K(X/cr 2 ). 

Therefore we reject the causal hypothesis Y — > X due to Postulate [Jj The interesting point is that 
we need not look at the alternative hypothesis X — > Y. In other words, we do not reject Y — > X 

6 for the more general case P(X = 1) = p with K(p) S> 0, this also follows if we assume that p is algorithmically 
independent of the parameters that specify P(Y\X) 
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P(y) 




P(-l,y) P(l,y) 




Figure 3: Left: a source generates the bimodal distribution P(Y). A machine generates x- values according to 
a conditional P(X\Y) given by a sigmoid function. If the slope and the position parameters of the sigmoid are 
not correctly adjusted to the distance, the position, and the width of the two Gaussian modes, the generated 
joint distribution no longer consists of two Gaussians (right). 



only because the converse direction leads to simpler expressions. We can reject it alone one the 
basis of observing algorithmic dependences between P(Y) and P{X\Y) making the causal model 
suspicious. 

The following gcdankencxperiment shows that Y — * X would become plausible if we "detune" 
the sigmoid P(X\Y) by choosing A, jl, a independently of A and fi, and a. Then P(Y) and P(X\Y) 
are by definition algorithmically independent and therefore we obtain a more complex joint distri- 
bution: 

K(P(X, Y)) = K(X) + K(X) + K( f i) + K{fi) + K(a) + K{5) . 

The fact that the set of mixtures of two Gaussians does not have six free parameters already shows 
that P(X, Y) must be a more complex distribution than the one above. Fig. [3] shows an example 
of a joint distribution obtained for the "detuned" situation. 

As already noted by [29], the independence of mechanisms is related to Pearl's thoughts on the 
stability of causal statements: the causal mechanism P(Xj\PAj) does not change if one changes the 
input distribution P(PAj) by influencing the variables PAj. The same conditional can therefore 
occur, under different background conditions, with different input distributions. 

Postulate [7] naturally occurs in the probability-free version of the causal Markov condition. 
To explain this, assume we are given two strings x and y of length n (describing two real- world 
observations) and noticed that x = y. Now we consider two alternative scenarios: 

(I) Assume that every pair (xj, y.j) of digits (j = 1, . . . ,n) has been independently drawn from the 
same joint distribution P(X, Y) of the binary random variables X and Y . 

(II) Let x and y be single instances of string-valued random variables X and Y. 

The difference between (I) and (II) is crucial for statistical causal inference: In case (I), statistical 
independence is rejected with high confidence proving the existence of a causal link. In constrast, 
there is no evidence for statistical dependence in case (II) since the underlying joint distribution 
on {0, 1}™ x {0, 1}™ could, for instance, be the point mass on the pair (x, y), which is a product 
distribution, i.e., 

P(X,Y) = P{Y)P{X) . 
Hence, statistical causal inference would not infer a causal connection in case (II). 
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Algorithmic causal inference, on the other hand, infers a causal link in both cases because the 
equality x = y requires an explanation. The relevance of switching between (I) and (II) then 
consists merely in shifting the causal connection to another level: In the i.i.d setting, every Xj must 
be causally linked to yj . In case (II) , there must be a connection between the two mechanisms that 
have generated the entire strings because I(P(X) : P(Y\X)) = I(P(X) : P(Y)) > 0. This can, 
for instance, be due to the fact that two machines emitting the same string were designed by the 
same engineer. A detailed discussion of the relevance of translating the i.i.d. assumption into the 
setting of algorithmic causal inference will be given in Subsection 13.21 

Examples with large probability spaces 

In the preceding subsection we have ignored a serious problem with defining the Kolmogorov 
complexity of (conditional) probability distributions that even occurs in finite probability spaces. 
First of all the "true" probabilities may not be computable. For instance, a coin may produce 
"head" with probability p where p is some uncomputable number, i.e., K(p) = oo. And even if it 
were some computable value p with large K(p) it would be quite artificial to call the probability 
distribution (p, 1— p) "complex" because K(p) is high and "simple" if we have, for instance p = 
A more reasonable notion of complexity can be obtained by describing the probabilities only up to 
a certain accuracy e. If e is not to small we obtain small complexity values for the distribution of 
a binary variable, and also low complexity for a distribution on a larger set that is e-close to the 
values of some simple analytical expression like a Gaussian distribution. There will still remain 
some unease about the concept of Kolmogorov complexity of "the true distribution" . We will 
subsequently develop a formalism that avoids this concept. However, Kolmogorov complexity of 
distributions is a useful idea to start with since it provides an intuitive understanding of the roots 
of the asymmetries between cause and effects that we will describe in Subsection 13.21 

Below, we will describe a gedankenexperiment with two random variables X, Y linked by the 
causal structure X — » Y where the total complexities of the causal models X — > Y and Y — > X 
both are well-defined and, in the generic case, different. First we will show that they can at most 
differ by factor two. 

Lemma 9 (maximal complexity quotient) 

For every joint distribution P(X, Y) we have 

K(P(Y)) + K(P(X\Y)) < 2(k(P(X)) + K{P(Y\X)fj . 
Proof: Since marginals and conditionals both can be computed from P(X, Y) we have 

K(P(Y)) + K{P{X\Y)) < 2K(P(X, Y)) . 
Then the statement follows because P(X,Y) can be computed from P(X) and P(Y\X). □ 

To construct examples where the bound in Lemma [9] is attained we first introduce a method to 
construct conditionals with well-defined complexity: 

Definition 6 (Conditionals and joint distributions from strings) 

Let Mo, Mi be two stochastic matrices that specify transition probabilities from {0,1} to {0,1}. 
Then 

M c := M Cl <g> M C2 ® • • • ® M Cn 
defines transition probabilities from {0, 1}" to {0, 1}". 
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We also introduce the same construction for double indices: Let Mqq, Moi, Miq, Mn be stochas- 
tic matrices describing transition probabilities from {0, 1} to {0, 1}. Let c, d G {0, 1}" be two strings. 
Then 

M C:rf := M Cl A ® M C2 . d2 <g> • • • <g> M Cn , d „ 

defines a transition matrix from {0, 1}™ to {0, 1}™. If the matrices Mj or Mij denote joint distri- 
butions on {0, 1} x {0, 1} the objects M c and M CjC j define joint distributions on {0, 1}" x {0, 1}™ in 
a canonical way. 

Let X, Y be variables whose values are the set of strings in {0, 1}". Define distributions Pq, Pi on 
{0, 1} and stochastic matrices Aq, A\ describing transition probabilities from {0, 1} to {0, 1}. Then 
a string c € {0, 1}™ defines a distribution P(X) := P c (using Definition [5]) that has well-defined 
Kolmogorov complexity K(c) if the description complexity of Pq and Pi is neglected. Furthermore, 
we set P(Y\X) := Ad as in Definition O where we have used the canonical identification between 
stochastic matrices and conditional probabilities and d £ {0, 1}™ denotes some randomly chosen 
string. Let Rij denote the joint distribution on {0, 1} x {0, 1} induced by the marginal Pi on the 
first component and the conditional Aj for the second, given the first. Denote the corresponding 
marginal dsitribution on the right component by Qij, i.e., 

Qij • -™-jP% , 

and let Bij be the stochastic matrix that describes the conditional probability for the first compo- 
nent, given the second. 

Using these notations and the ones in Definition [51 we obtain 



P(X) 


= Pc 


P(Y\X) 


= A d 


P(X,Y) 


= R c ,d 


P{Y) 


= Qc,d 


P(X\Y) 





It is noteworthy that P(Y) and P(X\Y) are labeled by both strings while P(X) and P(Y\X) 
are described by only one string each. This already suggests that the latter are more complex in 
the generic case. 

Now we compare the sum K(P(X))+K(P(Y\X)) to K(P(Y))+K(P(X\Y)) for the czseK(c) = 
K{d) = n. We assume that Pi and Aj are computable and their complexity is counted as 0(1) 
because it does not depend on n. Nevertheless, we assume that Pi and Aj are "generic" in the 
following sense: All marginals Qij and conditionals Bij are different whenever Pq ^ Pi and Aq ^ 
Ai. If we impose one of the conditions Pq — Pi and Aq = A\ or both, we assume that only those 
marginals Qij and conditionals B^ coincide for which the equality follows from the conditions 
imposed. Consider the following cases: 

Case 1: Pq = Pi, Aq = A\. Then all the complexities vanish because the joint distribution does 
not depend on the strings c and d. 

Case 2: Pq ^ Pi, Aq = A\. Then the digits of c are relevant, but the digits of d are not. Those 
marginals and conditionals in table (|!?CJ]) that formally depend on c and d, as well as those that 
depend on c, have complexity n. Those depending on d have complexity 0. 

K(P(X)) + K(P(Y\X)) = n + = n 
K(P(Y)) + K(P(X\Y)) = n + n = 2n . 
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Case 3: Pq = Pi, Aq ^ A\. Only the dependence on d contributes to the complexity. This implies 



K(P(X)) + K(P(Y\X)) = Q + n = n 
K(P(Y))+K(P(X\Y)) = n + n = 2n . 

Case 4: Pq =^ Pi and Aq ^ A\. Every formal dependence of the conditionals and marginals on c 
and d in table (|20[) is a proper dependence. Hence we obtain 

K(P(X)) + K(P(Y\Y)) = n + n = 2n 
K(P(Y))+K(P(X\Y)) = 2n + 2n = 4n . 

The general principle of the above example is very simple. Given that P(X) is taken from a 
model class that consists of N different elements and P(Y\X) is taken from a class with M different 
elements. Then the class of possible P(Y) and the class of possible P(X\Y) both can contain N-M 
elements. If the simplicity of a model is quantified in terms of the size of the class it is taken from 
(within a hierarchy of more and more complex models), the statement that P(Y) and P(X\Y) are 
typically complex is just based on this simple counting argument. 

Detecting common causes via dependent Markov kernels 

The following model shows that latent common causes can yield joint distributions whose Kol- 
mogorov complexity is smaller than K(P{X))+K{P(Y\X)) and K{P{Y)) + K(X\Y)). Let X, Y, Z 
have values in {0, l} n and let P(Z) := S c be the point mass on some random string c G {0, 1}™. 
Let P(X\Z) and P(Y\Z) both be given by the stochastic matrix A ® A ® • • • ® A. Let P ^ P x be 
the probability vectors given by the columns of A. Then 

P(X) - P(Y) = P c , 

with P c as in Definition \5\ Since P(Z) is supported by the singleton set {c}, we have P(X\Y) = 
P{X) and P(Y\X) = P(Y). Thus 

K(P(X)) + K(P(Y\X)) = K(P(X\Y))+K(P(Y)) 

= K(P{X)) + K{P{Y))=2n. 

On the other hand, we have 

K(P(X\Z)) + K(P(Y\Z)) + K{P{Z)) = + + n = n. 
By observing that there is a third variable Z such that 

K(P(X\Z)) + K{P(Y\Z)) + K{P(Z)) ± K(P(X, Y)) , 
we thus have obtained a hint that the latent model is the more appropriate causal hypothesis. 

Analysis of the required sample size 

The following arguments show that the above complexities of the Markov kernels become relevant 
already for moderate sample size. Readers who are not interested in technical details may skip the 
remaining part of the subsection. 
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Consider first the sampling required to estimate c by drawing i.i.d. from P c as in Definition [5l 
By counting the number of symbols 1 that occur at position j we can guess whether Cj is or 1 by 
choosing the distribution for which the relative frequency is closer to the corresponding probability. 
To bound the error probabilities from above set 

/i:= |P (1) - f\(l)| . 

Then the probability q that the relative frequency deviates by more than fi/2 decreases exponentially 
in the number of copies, i.e., q < e - '"™" where a is an appropriate constant. The probability to 
have no error for any digit is then bounded from below by (1 — e~' lma ) n . We want to increase m 
such that the error probability tends to zero. To this end, choose m such that e _MmQ < 1/n 2 , i.e., 
m > lnn 2 /(/ia). Hence 

(l-e-— )">(!- ±Y^l 



because 



and therefore 



l-l/n 2 ) ->l/e, 



lim l-l/n 2 = \ lim l-l/n 2 = lim 

The required sample size thus grows only logarithmically in n. In the same way, one shows that 
the sample size needed to distinguish between different conditionals P(Y\X) = A c increases only 
with the logarithm of n provided that P(X) is a strictly positive product distribution on {0, 1}™. 



3.2 Resolving statistical ensembles into individual observations 

The assumption of independent identically distributed random variables is one of the cornerstones 
of standard statistical reasoning. In this section we show that the independence assumption in a 
typical statistical sample is often due to prior knowledge on causal relations among single objects 
which can nicely represented by a DAG. We will see that the algorithmic causal Markov condition 
then leads to non-trivial implications. 

Assume we describe a biased coin toss, m times repeated, and obtain the binary string x%, . . . , x m 
as result. This is certainly one of the scenarios where the i.i.d. assumption is well justified because 
if we do not believe that the coin changes or that the result of one coin toss influences the other 
ones. The only relation between the coin tosses is that they refer to the same coin. We will thus 
draw a DAG representing the relevant causal relations for the scenario where C (the coin) is the 
common cause of all Xj (see fig. 0}. 

Given the relevant information on C (i.e., given the probability p for "head"), we have condi- 
tional algorithmic independence between the Xj when applying the Markov condition to this causal 
graphQ However, there are two problems: (1) it does not make sense to consider algorithmic mutual 
information among binary strings of length 1. (2) Our theory developed so far (Theorems [3] and Ql 
considered the number of strings (which is m + 1 here) as constant and thus even the complexity 
of x\, . . . ,x m is considered as 0(1). To solve this problem, we define a new structure with three 
nodes as follows. For some arbitrary k < m set x 1 := x%, . . . , Xk and x 2 := Xk+i, ■ ■ ■ , x m . Then 
C is the common cause of x 1 and x 2 and /(x^x^C) = because every similarity between x 1 
and x 2 is due to their common source (note that the information that the strings x 3 have been 

7 This is consistent with the following Bayesian interpretation: if we define a non-trivial prior on the possible values 
of p, the individual observations are statistically dependent when marginalizing over the prior, but knowing p renders 
them independent. 
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Figure 4: Causal structure of the coin toss. The statistical properties of the coin C define the common cause 
that links the results of the coin toss. 

obtained by combining k and n — k results, respectively, is here implicitly considered as background 
information in the sense of relative causality in Subsection I2.3[) . We will later discuss examples 
where a source generates symbols from a larger probability space. Then every Xj is a string and 
it is important to keep in mind the "format information", i.e., the information how to read the 
concatenation x\, X2, ■ ■ ■ , Xk as a sample of m strings. This format information will always be 
considered as background, too. 

Of course, we may also consider partitions into more than two substrings keeping in mind that 
their number is considered as 0(1). When we consider causal relations between short strings we will 
thus always apply the algorithmic causal Markov condition to groups of strings rather than applying 
it to the "small objects" itself. The DAG that formalizes the causal relations between instances 
or groups of instances of a statistical ensemble and the source that determines the statistics in the 
above sense will be called the "resolution of statistical ensembles into individual observations" . 

The resolution gets more interesting if we consider causal relations between two random variables 
X and Y. Consider the following scenario where X is the cause of Y. Let S be a source generating 
x-values x\,...,x m according to a fixed probability distribution P(X). Let M be a machine 
that receives these values as inputs and generates y- values y\ , . . . , y m according to the conditional 
P(Y\X). Fig[5] (left) shows the causal graph for m — 4. 

In analogy to the procedure above, we divide the string x := x\, . . . , x m into x 1 := xi, . . . , Xk 
and x 2 := Xk+i , ■ ■ • , x m and use the same grouping for the y- values. We then draw the causal graph 
in fig. O (right) showing causal relations between x 1 , x 2 , y 1 , y 2 , S, M. Now we assume that P (X) 
and P(Y\X) are not known, i.e., we don't have access to the relevant properties of S and M. Thus 
we have to consider S and M as "hidden objects" (in analogy to hidden variables in the statistical 
setting). Therefore we have to apply the Markov condition to the causal structure in such a way 
that only the observed objects x 1 , x 2 , y 1 , y 2 occur. One checks easily that x 2 d-separates x 1 and y 2 
and x 1 d-separates x 2 and y 1 . Exhaustive search over all possible triples of subsets of x 1 , x 2 , y 1 , y 2 
shows that these are the only non-trivial d-separation conditions. We conclude 

7(x 1 ;y 2 |x 2 )±0 and /(x 2 ; yV) ± . (21) 

The most remarkable property of eq. (|21| is that it is asymmetric with respect to exchanging the 
roles of X and Y since, for instance, /(y 1 ; x 2 |y 2 ) = can be violated. Intuitively, the reason is 
that given y 2 , the knowledge of x 2 provides better insights into the properties of S and M than 
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Figure 5: Left: causal structure obtained by resolving the causal structure X — > Y between the random 
variables X and Y into causal relations among single events. Right: causal graph obtained by combining the 
first k observations to x 1 and the remaining m — k to x 2 and the same for Y. We observe that x 2 d-separates 
x 1 and y 2 , while y 2 does not d-separate y 1 and x 2 . This asymmetry distinguishes causes from effects. 
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Figure 6: Visualization of the truncation process: The source S generates always the same string, the machine 
truncates either the left or the right end. Given only the four strings xi,X2 and 2/1,2/2 as observations, we can 
reject the causal hypothesis Y — *■ X . This is because I(x\ : 2/2^1) can be significantly greater than zero provided 
that the substrings missing in 2/1,2/2 at the left or at the right end, respectively, are sufficiently complex. 



knowledge of x 1 would do, which can be an advantage when describing y 1 . The following example 
shows that this asymmetry can even be relevant for sample size m = 2 provided that the probability 
space is large. 

Let S be a source that always generates the same string a £ {0, 1}™. Assume furthermore that 
a is algorithmically random in the sense that K(a) = n. For sample size m = 2 we then have 
x = (xi,X2) = (a, a). Let M be a machine that randomly removes £ digits randomly either at 
the beginning or the end from its input string of length n. By this procedure we obtain a string 
Uj G {0, 1}™ with n := n — £ from Xj. 

For sample size 2 it is likely that 2/1 and 2/2 contain the last n — £ and the first n — £ digits of 
a, respectively, or vice versa. This process is depicted in fig. [6] for n = 8 and £ = 2. Since the 
sample size is only two, the partition of the sample into two halves leads to single observations, i.e., 
x J = Xj and y 3 = yj for j = 1, 2. 

In short-hand notation, y 1 = a^i ,. n —i] and y 2 = a^+i n j. We then have 

/(x^yVjio and I{y l ; x^x 1 ) ± , 

but 

I(y 1 ;x 2 \y 2 )±£ and /(x 1 ; yV) ± £ , 

which correctly lets us prefer the causal direction X — > Y because these dependences violate the 
global algorithmic Markov condition in Theorem [3] when applied to a hypothetical graph where y 1 
and y 2 are the outputs of the source and x 1 and x 2 are the outputs of a machine that has received 
y 1 and y 2 . 

Even though the condition in eq. (|2ip does not explicitly contain the notion of complexities 
of Markov kernels it is closely related to the algorithmic independence of Markov kernels. To 
explain this, assume we would generate algorithmic dependences between S and M by adding an 
arrow S — > M or S M or by adding a common cause. Then x 2 would no longer d-separate x 1 
from y 2 . The possible violation of eq. (f2"Tj) could then be an observable result of the algorithmic 
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dependences between the hidden objects S and M (and their statistical properties P(X) and 
P(Y\X), respectively). 

3.3 Conditional density estimation on subsamples 

Now we develop an inference rule that is even closer to the idea of checking algorithmic dependences 
of Markov kernels than condition (|21j) , but still avoids the notion of Kolmogorov complexity of the 
"true" conditional distributions by using finite sample estimates instead. Before we explain the idea 
we mention two simpler approaches for doing so and describe their potential problems. It would be 
straightforward to apply Postulate[7]to the finite sample estimates of the conditionals. In particular, 
minimum description length (MDL) approaches |21j appear promising from the theoretical point of 
view due to their close relation to Kolmogorov complexity. We rephrase the minimum complexity 
estimator described by Barron and Cover |30j : Given a string- valued random variable X and a 
sample x%, . . . , x m drawn from P(X), set 

m 

P m (X) argmin{#(Q) - ^logQfe)} , 

i=i 

where Q runs over all probability densities on the probability space under consideration. If the data 
is sampled from a computable distribution, then P m {X) converges in probability to P(X) [30] , Let 
us define a similar estimator P m (Y\X) for the conditional density P(Y\X). Could we reject the 
causal hypothesis I-*y after observing that P m (X) and P m (Y\X) are mutually dependent? In 
the context of the true probabilities, we have argued that P(X) and P{Y\X) represent independent 
mechanisms. However, for the estimators we do not see a justification for independence because 
the relative frequencies of the x-values influence the estimation of P m {X) and P m (Y\X). This 
counter-argument becomes irrelevant only if the sample size is such that the complexities of the 
estimators coincide with the complexities of the true distributions. If we assume that the latter 
are typically uncomputable (because generic real numbers are uncomputable) this sample size will 
never be attained. 

The general idea of MDL [21] also suggests the following causal inference principle: If we are 
given the data points (xj,yj) with j = 1, . . . , m, consider the MDL estimators P m (X) and P m (Y\X). 
They define a joint distribution that we denote by Px^y{X, Y) (where we have dropped m for 
convenience). The total description length 

m 

C X ^Y := K(P m {X)) + K(P m {Y\X)) - £ logiw ( Xj , y 3 ) 

i=i 

measures the complexity of the probabilistic model plus the complexity of the data, given the model. 
Then we compare Cx^y to Cy^x (defined correspondingly) and prefer the causal direction with 
the smaller value. However, it is not clear whether this kind of reasoning can be derived from the 
algorithmic Markov condition. 

For this reason, we construct an inference rule that uses estimators in a more sophisticated 
way and whose justification is directly based on applying the algorithmic Markov condition to the 
resolution of ensembles. The idea of our strategy is that we do not use the full data set to estimate 
P(Y\X). Instead, we apply the estimator to a subsample of (x, y) pairs that no longer carries 
significant information about the relative frequencies of a;- values in the full data set. As we will see 
below, this leads to algorithmically independent finite sample estimators for the Markov kernels if 
the causal hypothesis is correct. 

Let X — > Y be the causal structure that generated the data (x, y), with x := x\, . . . ,x m and 
y := . . . , y m after m-fold i.i.d. sampling from P(X, Y). The resolution of the ensemble is the 
causal graph in fig. [71 left. 
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Fi gure 1'. (left) Causal structure between single observations x\, . . . , x m ,y\ 1 . . . , y m for sampling from P(X, Y"), 
given the causal structure X — > Y. The programs pj compute Xj from the description of the source S. The 
programs qj compute yj from Xj and the description of the machine M, respectively. The grey nodes are those 
that are selected for the subsample (see text). Right: Causal structure relating x, xj, and yj. Note that the 
causal relation between Xj and yj is the same as the one between the corresponding pair Xj and yj. Here, for 
instance, x 3 = X4 and y 3 = j/4 and it is thus still the same program 54 that computes j/4 from X4 and M. Hence, 
the causal model that links M with the selected values Xj and yj is the subgraph of the graph showing relations 
between Xj, yj and M. This kind of robustness of the causal structure with respect to the selection procedure 
will be used below. 
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According to Postulate [6] there are mutually independent programs pj computing Xj from the 
description of S. Likewise, there are mutually independent programs qj computing yj from M and 
Xj. Assume we are given a rule how to generate a subsample of x\, . . . , x m from x. It is important 
that this selection rule does not refer to y but only uses x (as well as some random string as 
additional input) and that the selection can be performed by a program of length 0(1). Denote 
the subsample by 

x X\ , . . . , x\ . Xj 1 . . . . , Xji , 
with I < m. The above selection of indices defines also a subsample of y-values 

y := Vj!,- ■ := Vi, 

By construction, we have 

Hence we can draw the causal structure depicted in fig. El right. 

Let now Dx be any string that is derived from x by some program of length 0(1). Dx may be 
the full description of relative frequencies or any computable density estimator P(X), or some other 
description of interesting properties of the relative frequencies. Similarly, let Dyx be a description 
that is derived from x, y by some simple algorithmic rule. The idea is that it is a computable 
estimator P(Y\X) for the conditional distribution P(Y\X) or any relevant property of the latter. 
Instead of estimating conditionals, one may also consider an estimator of the joint density of the 
subsample. We augment the causal structure in fig. El right, with Dx and Dyx- The structure 
can be simplified by merging nodes in the same level and we obtain the structure in fig. [5] 

To derive testable implications of the causal hypothesis, we observe that every information 
between Dx and Dyx is processed via x. We thus have 

D XY ± D x |x* , (22) 

which formally follows from the global Markov condition in Theorem|31 Using Lemma|S]and eq. (|2"2"| 
we conclude 

I{D X ;D YX ) <I(x;Dx). (23) 

The intention behind generating the subsample x is to "blur" the distribution of X. If we have 
a density estimator P(X) we try to choose the subsample such that the algorithmic mutual infor- 
mation between x and P(X) is small. Otherwise we have not sufficiently blurred the distribution 
of X. Then we apply an arbitrary conditional density estimator P(Y\X) to the subsample. If 
there still is a non-negligible amount of mutual information between P x and P(Y\X), the causal 
hypothesis in fig. [3 left, cannot be true and we reject X — > Y. 

To show that the above procedure can also be applied to data sampled from uncomputable 
probability distributions, let P$ and Pi be uncomputable distributions on {0,1} and A ,Ai un- 
computable stochastic maps from {0, 1} to {0, 1}. Define a string-valued random variable X with 
distribution P(X) :— P c as in Definition [5] and the conditional distribution of a string- valued vari- 
able Y by P(Y\X) := Ad as in Definition [5] for strings c,d E {0, 1}". Let Pq and Pi as well as 
Aq and A\ be known up to an accuracy that is sufficient to distinguish between them. We assume 
that all this information (including n) is given as background knowledge, but c and d are unknown. 
Let Dx ='■ c', where c' is the estimated value of c computed from the finite sample x of size m. 
Likewise, let Dxy ■— d' be the estimated value of d derived from the subsample (x, y) of size m. 
If m is large enough (such that also fh is sufficiently large) we can estimate c and d, i.e, c' = c 
and d' = d with high probability. The most radical method to blur P(X) is to choose x such that 
the empirical distribution of x- values is uniform and the Xj -values are lexicographically reordered 
(with some random ordering among the j-values that correspond to the same a;- value). The only 
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Figure 8: Dx is some information derived from x. The idea is that it is a density estimator for P{X) or that it 
describes properties of the empirical distribution of ^-values. If the selection procedure x — > x has sufficiently 
blurred this information, the mutual information between x and Dx is low. Dxy on the other hand, is a density 
estimator for P(Y\X) or it encodes some desired properties of the empirical joint distribution of x- and y- values 
in the subsample. If the mutual information between Dx and Dx.y exceeds the one between x and Dx, we 
reject the hypothesis X — > Y. 



33 



algorithmic information that x then contains is the description of its length, i.e., log 2 rh bits. Hence 
we have 

I(D X ■ x) < log 2 rh . 

Assume now that c = d. Then 

I(D X : Dxy) = n , 

provided that the estimation was correct. As shown at the end of Subsection 13.11 this is already 
possible for rh = O(logn), i.e., 

I(D X : x) G 0(log 2 n) , 

which violates ineq. (f23|) . The importance of this example lies in the fact that I(P(X) : P(Y\X)) 
is not well-defined here because P(X) and P(Y\X) both are uncomputable. Nevertheless, P(X) 
and P(Y\X) have a computable aspect, i.e, the strings c and d characterizing them. Our strategy 
is therefore suitable to detect algorithmic dependences between computable features. 

It is remarkable that the above scheme is general enough to include also strategies for very small 
sample sizes provided that the probability space is large. To describe an extreme case, we consider 
again the example with the truncated strings in fig. [5] with the role of X and Y reversed. Let Y 
be a random variable whose value is always the constant string a G {0, 1}™. Let P(X\Y) be the 
mechanism that generates X by truncating either the I leftmost digits or the I rightmost digits of 
Y (each with probability 1/2). We denote these strings by oieft and a r i g ht, respectively. Assume 
we have two observations x\ = aieft, yi = c and X2 = a r i g ht, yi = a. We define a subsample by 
selecting only the first observation x\ := x\ — ai e ft and y\ := yi = a. Then we define Dx '■= x\, X2 
and Dxy ■— yi- We observe that the mutual information between Dxy and Dx is K{a), while the 
mutual information between Dx and x is only K(a\ c { t )- Given generic choices of a, this violates 
condition (f23|) and we reject the causal hypothesis X — »• Y. 

3.4 Plausible Markov kernels in time series 

Time series are interesting examples of causal structures where the time order provides prior knowl- 
edge on the causal direction. Since there is a large number of them available from all scientific 
disciplines they can be useful to test causal inference rules on data with known ground truth. Let 
us consider the following example of a causal inference problem. Given a time series and the prior 
knowledge that it has been generated by a first order Markov process, but the direction is un- 
known. Formally, we are given observations x\, x%, 2:3, . . . ,x m corresponding to random variables 
X\ , X2 , . . • , X m such that the causal structure is either 

> X 1 X 2 -> A 3 (24) 

or 

• • • f— X\ <— X2 <— X3 ■ ■ ■ <— X n -s— • • • , (25) 

where we have extended the series to infinity in both directions. 

The question is whether the asymmetry of the joint distribution with respect to time inversion 
provides hints on the real time direction. Let us assume now that the graph (|2"4")l corresponds to 
the true time direction. Then the hope is that P{Xj + \\Xj) is simpler, in some reasonable sense, 
than P(Xj\Xj + i). At first glance this seems to be a straightforward extension of the principle 
of plausible Markov kernel discussed in Subsection 13.11 However, there is a subtlety with the 
justification when we apply our ideas to stationary time series: 

Recall that the principle of minimizing the total complexity of all Markov kernels over all 
potential causal directions has been derived from the independence of the true Markov kernels 
(remarks after Postulate [7]) ■ However, the algorithmic independence of P(Xj\PAj) = P{Xj\Xj-\) 
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and P(Xi\PAi) = P(Xj|Xj_i) fails spectacularly because stationarity implies that these Markov 
kernels coincide and represent a causal mechanism that is constant in time. This shows that the 
justification of minimizing total complexity breaks down for stationary time series. 

The following argument shows that not only the justification breaks down but also the principle 
as such: Consider the case where P(Xj) is the unique stationary distribution of the Markov kernel 
P(X j+ x\Xj). Then we have 

*r(PpO|X j+1 )) < KiPiX^Xjj) ± K(P(X j+1 \Xj)) . 

Because the forward time conditional describes uniquely the backward time conditional (via im- 
plying the description of the unique stationary marginal) the Kolmogorov complexity of the latter 
can exceed the complexity of the former only by a constant term. 

We now focus on nora-stationary time series. To motivate the general idea we first present an 
example described in [31j . Consider a random walk of a particle on Z starting at z € Z. In every 
time step the probability is q to move one site to the right and (1 — q) to move to the left. Let 
Xj with j = 0, 1, . . . be the random variable describing the position after step j. Then we have 
P(Xq = z) = 1. The forward time conditional reads 

!q for x j+i — x j + 1 

1 — q for x j+i — x 3 ' ~ 1 

otherwise . 

To compute the backward time conditional we first compute P{Xj) which is given by the distribu- 
tion of a Bernoulli experiment with j steps. Let k denote the number of right moves, i.e., j — fc is 
the number of left moves. With xj = k — (j — k) + z = 2k — j + z we thus obtain 

P(r ■) = n^+ x i- z )/ 2 h — n\ti- x i+ z )/ 2 ( ^ | 

F{x ^- q (1 q) { {j + x .- z y 2 J- 

Elementary calculations show 

p(xj\x j+1 ) - p(x j+1 \ Xj y PUl) 



P(x j+1 ) 



(j+xj-z) /2+1 , — , 

j+i tor Xj - Xj+i - 1 

j+i tor x j - Xj=i + 1 

otherwise . 

The forward time process is specified by the initial condition P(Xq) (given by z) and the transition 
probabilities P(Xj, . . . , Xi\Xq) (given by p). A priori, these two "objects" are mutually unrelated, 
i.e., 

K{P{X ),P{X h X^ u . . .,Xx\X )) ± 

K(P(X )) + X(P(X„ AVi, ■ • ■ , XilXo)) = 
K(z) + K(q). 

On the other hand, the description of P(Xj) (the "initial condition" of the backward time process) 
alone already requires the specification of both z and q. The description of the "transition rule" 
P(Xi, . . . , Xj-i\Xj) refers only to z. We thus have 

K{P{X 3 )) + K(P(X 1 ,X 2 , . . . , Xj-i\Xj)) ± 2K(z) + K(q) . 
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Hence 

I(P(Xj) : P(X ,Xi, . . .,Xi-x\Xj)) ± K{z) . 
The fact that the initial distribution of the hypothetical process 

Xj — > Xj-i Xq 

shares algorithmic information with the transition probabilities makes the hypothesis suspicious. 
Resolving time series 

We have seen that the algorithmic dependence between "initial condition" and "transition rule" of 
the backward time process (which would be surprising if it occurred for the forward time process) 
represents an asymmetry of non-stationary time-series with respect to time reflection. We will now 
discuss this asymmetry after resolving the statistical ensemble into individual observations. 

Assume we are given m instances of n-tuples x\ , . . . , Xn with i = 1, . . . ,m that have been 
i.i.d. sampled from P(Xi, . . . , X n ) and X\, . . . , X n are part of a time series that can be described 
by a first order stationary Markov process. Our resolution of a statistical ensemble generated by 
X —> Y contained a source S and a machine M. The source generates x-values and the machine 
generates y- values from the input x. The algorithmic independence of S and M was essential for 
the asymmetry between cause and effect described in Subsection 13.21 For the causal chain 

• • • — > X\ — > X2 — > X3 — > • • • 

we would therefore have machines Mj generating the -value from Xj-\. However, for stationary 
time-series all Mj are the same machine. The causal structure of the resolution of the statistical 
ensemble for m = 2 is shown in fig. [£l left. 

This graph entails no independence constraint that is asymmetric with respect to reversing the 
time direction. To see this, recall that two DAGs entail the same set of independences if and only 
if they have the same skeleton (i.e. the corresponding undirected graphs coincide) and the same 
set of unshielded colliders (^-structures) , i.e., substructures A — > C <— B where A and B are non- 
adjacent pp. Fig. [9] has no such u-structure and the skeleton is obviously symmetric with respect 
to time-inversion. 

The initial part is, however, asymmetric (in agreement with the asymmetries entailed by fig. El 
left) and we have 

This is just the finite-sample analogue of the statement that the initial distribution P(Xq) and the 
transition rule P(Xj\Xj-i) are algorithmically independent. 

4 Decidable modifications of the inference rule 

To use the algorithmic Markov condition in practical applications we have to replace it with com- 
putable notions of complexity. The following two subsections discuss two different directions along 
which practical inference rules can be developed. 

4.1 Causal inference using symmetry constraints 

We have seen that the algorithmic causal Markov condition implies that the the sum of the Kol- 
mogorov complexities of the Markov kernels must be minimized over all possible causal graphs. 
In practical applications, it is natural to replace the minimization of Kolmogorov complexity with 
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Figure 9: Left: causal graph of a time series. The values xf' corresponds to the jth instance at time i. Right: 
the initial part of the time-series is asymmetric with respect to time-inversion. 
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Figure 10: Two probability distributions P(X) (left) and P(Y) (right) on the set {1, . . . , 120} both having 4 
peaks at the positions n l7 . . . , n 4 , but the peaks in P(X) are well-localized and those of P(Y) are smeared out 
by a random walk 



a decidable simplicity criterion even though this makes the relation to the theory developed so 
far rather vague. In this subsection we will describe an empirically decidable inference rule and 
show that the relation to Kolmogorov complexity of conditionals is closer than it may seem at first 
glance. 

Moreover, the example below shows a scenario where the causal hypothesis X — > Y can already 
be preferred to Y — » X by comparing only the marginal distributions P(X) and P(Y) and observing 
that a simple conditional P(Y\X) leads from the former to the latter but no simple conditional 
leads into the opposite direction. The example will furthermore show why the identification of 
causal directions is often easier for probabilistic causal relations than for deterministic ones, a point 
that has also been pointed out by Pearl [I] in a different context. 

Consider the discrete probability space {1, . . . , N}. Given two distributions P(X), P(Y) like 
the ones depicted in fig. [10] for N = 120. The marginal P(X) consists of k sharp peaks of equal 
height at positions ni, . . . , n.^ and P(Y) also has k modes centered at the same positions, but 
with greater width. We assume that P(Y) can be obtained from P(X) by repeatedly applying 



a doubly stochastic matrix A = (a, t 



.N 



with an = 1 — 2p for p € (0, 1) and = p for 



i = j ± l(modiV). The stochastic map A thus defines a random walk and we have by assumption 

P(Y) = A m P(X) 

for some meN. 

Now we ask which causal hypothesis is more likely: (1) P(Y) has been obtained from P(X) by 
some stochastic map M. (2) P(X) has been obtained from P(Y) by some stochastic map M. Our 
assumptions already contain an example M that corresponds to the first hypothesis (M := A m ). 
Clearly, there also exist maps M for hypothesis (2). One example would be 



M 



[P(X),P(X),...,P(X)], 



(26) 



i.e. M has the probability vector P(X) in every column. 

To describe in which sense X — > Y is the simpler hypothesis we observe that M in eq. (|2"6")) 
already contains the description of the positions ni, . . . , n& whereas M = A m is rather simple. The 
Kolmogorov complexity of M as chosen above is for a generic choice of the positions ni, . . . ,n& 
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given by 

K(M) = K{P{Y)) = log 

where = denotes equality up to a term that does not depend on N. This is because different 
locations n%, . . . , of the original peaks lead to different distributions P(Y) and, conversely, every 
such P(Y ) is uniquely defined by describing the positions of the corresponding sharp peaks and M. 

However, we want to prove that also other choices of M necessarily have high values of Kol- 
mogorov complexity. To this end, we define a family of (^) probability distributions Pj(X) given 
by equally high peaks at the positions rii , . . . , rif- and accordingly the smoothed probability distri- 
butions Pj(Y). We first need the following result. 

Lemma 10 (average complexity of stochastic maps) 

Let (Qj(X))j = i^..^£ and (Qj(Y))j=i g be two families of marginal distributions of X and Y, 

respectively. 

Moreover, let {Aj)j—y t ..j, be a family of not necessarily different stochastic matrices with AjQj(Y) = 
Qj(X). Then 

1 e 

-£i^)>J(X: J)-I(Y: J), 

3=1 

where the information that X contains about the index j is given by 

I(X : J) := H^YjQAX)) -j^H^X)), 

j 

J denotes the random variable with values j . Here, H(.) denotes the Shannon entropy and I(Y : J) 
is computed in a similar way as I(X : J) using Qj(Y) instead ofQj(X). 

Proof: The idea is to show that we need at least 2 A different stochastic matrices to achieve that 
the information I(X : J) exceeds I(Y : J) by the amount A. Using a standard argument rephrased 
below, the average complexity is therefore at least A. 

Assume, for instance, that all Aj coincide. Then the usual data processing inequality 18J shows 
that applying the same matrix to the different distributions can never increase the information 
on the index j, i.e., I(X : J) < I(Y : J). To derive the lower bound on the number of different 
matrices required we define a partition of {1, . . . , £} into d sets Si, . . . , Sk for which the Aj coincide. 
In other words, we have Aj — B r if j G S r and the matrices Bi, ... , Bd are chosen appropriately 
We define a random variable R whose value r indicates that j lies in the rth equivalence class. The 
above "data processing argument" implies 

I(X : J\R) < I(Y : J\R) . (27) 

Furthermore, we have 

I{X:R) <I(Y :R) + log 2 d. (28) 

This is because both I(X : R) and I(Y : R) cannot exceed log 2 d because d is the number of values 
R can attain. Then we have: 

I(X : J) = I{X : J,R) = I{X : R) + I(X : J\R) 
< I{Y : R) + log 2 d + I{Y : J\R) 
= log 2 d + I(Y : J) . 
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The first equality follows because R contains no additional information on X (when J is known) 
since it describes only from which equivalence class j is taken. The second equality is a general rule 
for mutual information [18]. The inequality combines ineqs. |27|) and (f28|) . The last equality follows 
similar as the equalities in the first line. This shows that we need at least 2 d different matrices 
with d = \I{X : J) - I(Y : J)~\ . We have 

- d^ - d' 

3 = 1 

where the first inequality holds because the exponential function is concave and the second is 
entailed by Kraft's inequality. This yields 

1 d 

2^K(B j )>]og 9 d, 
j=i 

completing the proof. □ 

To apply Lemma ITD1 to the above example we define families of I := distributions Pj(X) 
having their peaks at the positions m,...,7ifc and also their smoothed versions Pj(Y). Mixing 
all probability distributions will generate the entropy log N for Pj (X) because we then obtain the 
uniform distribution. Since we have assumed that Pj(Y) can be obtained from Pj(X) by a doubly 
stochastic map, mixing all Pj(Y) also yields the uniform distribution. Hence the difference between 
I{X : J) and I(Y : J) is simply given by the average entropy difference 

AH: =]j2 ( H ( P A Y )) - H ( p j( x ))) ■ 

3=1 

The Kolmogorov complexity required to map Pj(Y) to Pj(X) is thus, on average over all j, at least 
the entropy generated by the double stochastic random walk. Hence we have shown that a typical 
example of two distributions with peaks at arbitrary positions ri\, . . . , needs a process M whose 
Kolmogorov complexity is at least the entropy difference. 

One may ask why to consider distributions with several peaks even though the above result 
will formally also apply to distributions Pj(X) and Pj(Y) with only one peak. The problem is 
that the statement "two distributions have a peak at the same position" does not necessarily make 
sense for empirical data. This is because the definition of variables is often chosen such that the 
distribution becomes centralized. The statement that multiple peaks occur on seemingly random 
positions seems therefore more sensible than the statement that one peak has been observed at a 
random position. 

We have above used a finite number of discrete bins in order or keep the problem as much 
combinatorial as possible. In reality, we would rather expect a scenario like the one in fig. [TT] 
where two distributions on R have the same peaks, but the peaks in the one distribution have been 
smoothed, for example by an additive Gaussian noise. 

As above, we would rather assume that X is the cause of Y than vice versa since the smoothing 
process is simpler than any process that leads in the opposite direction. We emphasize that denois- 
ing is an operation that cannot be represented by a stochastic matrix, it is a linear operation that 
can be applied to the whole data set in order to reconstruct the original peaks. The statement is 
thus that no simple stochastic process leads in the opposite direction. To further discuss the ratio- 
nale behind this way of reasoning we introduce another notion of simplicity that does not refer to 
Kolmogorov complexity. To this end, we introduce the notion of translation covariant conditional 
probabilities: 
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Figure 11: Two probability distributions P(X) (solid) and P(Y) (dashed) where P(Y) can be obtained from 
P(X) by convolution with a Gaussian distribution 

Definition 7 (translation covariance) 

Let X,Y be two real-valued random variables. A conditional distribution P(Y\X) with density 
P{y\x) is called translation covariant if 

P(y\x + t)=P(y-t\x) VieR. 

Apart from this, we will also need the following well-known concept from statistical estimation 
theory [35]: 

Definition 8 (Fisher information) 

Let p(x) be a continuously differentiable probability density of P(X) on R. Then the Fisher infor- 
mation is defined as 

F(P(X)) := J ^\np(x^j dx . 

Then we have the following Lemma (see Lemma 1 in |33j showing the statement in a more 
general setting that involves also quantum stochastic maps): 

Lemma 11 (monotonicity under covariant maps) 

Let P(X,Y) be a joint distribution such that P(Y\X) is translation covariant. Then 

F(P(Y)) < F(P(X j) . 

The intuition is that F quantifies the degree to which a distribution is non-invariant with respect 
to translations and that no translation covariant process is able to increase this measure. The 
convolution with a Gaussian distribution with non-zero variance decreases the Fisher information. 
Hence there is never a translation invariant stochastic map in backward direction. 

The argument above can easily be generalized in two respects. First, the argument works also 
with other quantities that are monotonous with respect to translation invariant stochastic maps. 
Second, we can also consider more general symmetries: 
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Definition 9 (general group covariance) 

Let X, Y be random variables with equal range S . Let G be a group of bijections g : S — * S and 
X 9 and Y 9 denoting the random variables obtained by permuting the outcomes of the corresponding 
random experiment according to g. Then we call a conditional P(Y\X) G-covariant if 

P{Y 9 \X)) = P{Y\X 9 ^) VgeG. 

It is easy to see that covariant stochastic maps define a quasi-order of probability distributions 
on S by defining P > P if there is a covariant stochastic map A such that A * P = P. This is 
transitive since the concatenation of covariant maps is again covariant. 

If a G- invariant measure /i ("Haar measure") exists on G we can easily define an information 
theoretic quantity that measures the degree of non-invariance with respect to G: 

Definition 10 (reference information) 

Let P{X) be a distribution on S and G be a group of bijections on S with Haar measure fi. Then 
the reference information is given by: 



Ig ■= HIP 
= HIP 



j X B d»(j,)^\-H(P(X)). 



dfi(g) 



The name "reference information" has been used in [34] in a slightly different context where 
this information occurred as the value of a physical system to communicate a reference system (e.g. 
spatial or temporal) where G describes, for instance, translations in time or space. The quantity 
Lq can easily be interpreted as mutual information L(X : Z) if we introduce a G-valued random 
variable Z whose values indicate which transformation g has been applied. One can thus show that 
Lq is non- increasing with respect to every G-covariant map [34] 135] . 

The following model describes a link between inferring causal directions by preferring covariant 
conditionals to preferring directions with algorithmically independent Markov kernels. Consider 
first the probability space S := {0, 1}. We define the group G := Z 2 = ({0, 1}, ®), i.e., the additive 
group of integers modulo 2, acting on S as bit-flips or identity. For any distribution on P on {0, 1}, 
the reference information Lg{P) then measures the asymmetry with respect to bit- flips. For two 
distributions P and P we can have the situation that a G-symmetric stochastic matrix leads from 
P to P, but only asymmetric stochastic maps convert P into P. Now we extend this idea to the 
group ZJf acting on strings of length n by independent bit-flips. Assume we have a distribution 
on {0, l} n of the form P c in Definition [5] for some string c and generate the distribution P c by 
applying M to P c where 

with ej G (0,1). Then M is G-symmetric, but no G-symmetric process leads backwards. This is 
because every such stochastic map would be asymmetric in a way that encodes c, i.e., the map 
would have "to know" c because M has destroyed some amount of information about it. 



4.2 Resource-bounded complexity 

The problem that the presence or absence of mutual information is undecidable (when defined via 
Kolmogorov complexities) is similar to statistics, but also different in other respects. Let us first 
focus on the analogy. Given two real- valued random variables X, Y, it is impossible to show by 
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finite sampling that they are statistically independent. X A. Y is equivalent to E(f(X)g(Y)) = 
E(f(X))E(g(Y)) for every pair (/, g) of measurable functions. If we observe significant correlations 
between f(X) and g(Y) for some previously defined pair, it is well-justified to reject independence. 
The same holds if such correlations are detected for /, g in some previously defined, sufficiently 
small set of functions (cf. [35]). However, if this is not the case, we can never be sure that there 
is not some pair of arbitrarily complex functions /, g that are correlated with respect to the true 
distribution. Likewise, if we have two strings x, y and find no simple program that computes x 
from y this does not mean that there is no such a rule. Hence, we also have the statement that 
there can be an algorithmic dependence even though we do not find it. 

However, the difference to the statistical situation is the following. Given that we have found 
functions /, g yielding correlations it is only a matter of the statistical significance level whether this 
is sufficient to reject independence. For algorithmic dependences, we do not even have a decidable 
criterion to reject independence. Given that we have found a simple program that computes x from 
y, it still may be true that I(x;y) is small because there may also be a simple rule to generate x 
(which would imply I(x : y) rs 0) that we were not able to find. This shows that we can neither 
show dependence nor independence. 

One possible answer to these problems is that Kolmogorov complexity is only an idealization of 
empirically decidable quantities. Developing this idealization only aims at providing hints in which 
directions we have to develop practical inference rules. Compression algorithms have already been 
developed that are intended to approximate, for instance, the algorithmic information of genetic 
sequences [37J 13H] • Chen et al. [35] constructed a "conditional compression scheme" to approximate 
conditional Kolmogorov complexity and applied it to the estimation of the algorithmic mutual 
information between two genetic sequences. To evaluate to which extent methods of this kind can 
be used for causal inference using the algorithmic Markov condition is an interesting subject of 
further research. 

It is also noteworthy that there is a theory on resource-bounded description complexity [19) 
where compressions of x are only allowed if the decompression can be performed within a previously 
defined number of computation steps and on a tape of previously defined length. An important 
advantage of resource-bounded complexity is that it is computable. The disadvantage, on the other 
hand, is that the mathematical theory is more difficult. Parts of this paper have been developed 
by converting statements on statistical dependences into their algorithmic counterpart. The strong 
analogy between statistical and algorithmic mutual information occurs only for complexity with 
unbounded resources. For instance, the symmetry I(x : y) = I(y : x) breaks down when replacing 
Kolmogorov complexity with resource-bounded versions |19j . Nevertheless, to develop a theory of 
inferred causation using resource-bounded complexity could be a challenge for the future. There are 
several reasons to believe that taking into account computational complexity can provide additional 
hints on the causal structure: 

Bennett [3J2 20] EI] , for instance, has argued that the logical depth of an object echoes in some 
sense its history. The former is, roughly speaking, defined as follows. Let i be a string that 
describes the object and s be its shortest description. Then the logical depth of x is the number of 
time steps that a parallel computing device requires to compute x from s. According to Bennett, 
large logical depth indicate that the object has been created by a process that consisted of many 
non-trivial steps. This would mean that there also is some causal information that follows from 
the time-resources required to compute a string from its shortest description. 

The time-resources required to compute one observation from the other also plays a role in the 
discussion of causal inference rules in |31| . The paper presents a model where the conditional 

P(effectjcause) 
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can be efficiently computed, while computing 

P(cause|effect) 

is NP-hard. This suggests that the computation time required to use information of the cause for 
the description of the effect can be different from the time needed to obtain information on the 
cause from the effect. However, the goal of the present paper was to describe asymmetries between 
cause and effect that even occur when computational complexity is ignored. 

5 Conclusions 

We have shown that our algorithmic causal Markov condition links algorithmic dependences be- 
tween single observations with the underlying causal structure in the same way This is similar to the 
way the statistical causal Markov condition links statistical dependences among random variables 
to the causal structure. The algorithmic Markov condition has implications on different levels: 

(1) In conventional causal inference one can drop the assumption that observations 

(x\ \ . . . , x\^) 

have been generated by independent sampling from a constant joint distribution 

P(Xi, . . . , X n ) 

of n random variables X\ , . . . , X n . Algorithmic information theory thus replaces statistical causal 
inference with a probability-free formulation. 

(2) Causal relations among individual objects can be inferred provided their shortest descriptions 
are sufficiently complex. 

(3) New statistical causal inference rules follow because causal hypotheses are suspicious if the 
corresponding Markov kernels are algorithmically dependent. 

Since algorithmic mutual information is uncomputable because Kolmogorov complexity is un- 
computable, we have presented decidable inference rules that are motivated by the uncomputable 
idealization. 
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